Release v0.5 · fixie-ai/ultravox

We're releasing Ultravox v0.5 today. The weights have been pushed to Hugging Face. If you're using the Ultravox Realtime APIs, v0.5 is the new default.

What's New

v0.5 improves upon 0.4.1 in the following ways:

60% improvement in transcription accuracy, with lower word error rates (WER) across 82 evaluation sets from LibriSpeech, CommonVoice, and Fleurs.
18% improvement in speech-based web question answering, particularly in handling named entities and fine-grained speech details.
24% improvement in X-to-English translation, as measured by BLEU across 19 languages
Expanded language support from 15 to 42 languages, making it significantly more accessible for global applications.

42 Languages Supported

Arabic, Belarusian, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hindi, Hungarian, Italian, Japanese, Latvian, Lithuanian, Macedonian, Marathi, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.

Evals

Our primary method of evaluation is speech translation, measured by BLEU and, newly for v0.5, Big Bench Audio for general reasoning in response to Audio input.

Ultravox 70B

	Ultravox 0.4.1 70B	Ultravox 0.5 70B
covost2 en_ar	19.64	20.21
covost2 en_de	32.47	34.53
covost2 es_en	40.76	43.29
covost2 ru_en	45.07	48.99
covost2 en_ca	37.58	40.01
covost2 zh_en	17.98	21.37
big bench audio	76.20	82.70

Ultravox 8B

	Ultravox 0.4.1 8B	Ultravox 0.5 8B
covost2 en_ar	12.28	12.99
covost2 en_ca	29.94	31.54
covost2 en_de	27.13	28.70
covost2 es_en	39.16	40.19
covost2 ru_en	39.65	42.13
covost2 zh_en	14.55	17.22
big bench audio	63.20	66.54

Training

This version of Ultravox continues to use a frozen Llama pre-trained core (3.1 for 8B and 3.3 for 70B), but we've significantly increased the size of the data and the overall training time. The training time on 8xH100s is about ~100 hours for the 8B model and ~150 hours for the 70B model.

What's Changed

Audio streaming training with masking by @saeeddhqan in #148
Defining block size in UltravoxConfig, and solving assertions by @saeeddhqan in #157
Gradio demo for real-time conversations with WebRTC by @freddyaboulton in #150
Fix "AttributeError: 'NoneType' object has no attribute 'tokenizer'" by @farzadab in #173
docs: update README.md by @eltociear in #174
Update ultravox model and config for v0.5 by @farzadab in #276

New Contributors

@saeeddhqan made their first contribution in #148
@freddyaboulton made their first contribution in #150
@eltociear made their first contribution in #174

Full Changelog: v0.4.1...v0.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5